Prepare Data
============

In this page, we will introduce the functions we provide to load datasets and split given data.

Load Data
---------
In ``s3l.datasets.base``, we provide some useful functions to load data. Here is the list:

::

    'load_data',
    'load_dataset',
    'load_graph',
    'load_boston',
    'load_diabetes',
    'load_digits',
    'load_iris',
    'load_breast_cancer',
    'load_linnerud',
    'load_wine',
    'load_ionosphere',
    'load_australian',
    'load_bupa',
    'load_haberman',
    'load_vehicle',
    'load_covtype',
    'load_housing10',
    'load_spambase',
    'load_house',
    'load_clean1'


Among them, ``load_data``, ``load_dataset`` and ``load_graph`` functions can be used to load the data you prepare. Other functions load the built-in datasets which are commonly used by researchers. These functions return the data in the form which can be used by estimators directly. For example,

.. code:: python

    X, y = load_XXX(return_X_y=False)
    # XXX is the name of dataset

We'll show you how to use the two user-oriented functions  ``load_data``, ``load_dataset`` and ``load_graph``. ``load_dataset`` is directly called in experiments classes, you can use them when you try algorithms outside experiment class or when you're implementing you own experiment class.

``load_data`` loads features and labels of a dataset given the file names.

.. code:: python

    X, y = load_data(feature_file, label_file)

``load_dataset`` wraps ``load_data`` with another parameter *name* and loads built-in dataset if *name* matchs.

.. code:: python

    X, y = load_dataset(name, feature_file, label_file)

``load_graph`` loads the graph in ``*.csv/npz/mat`` file and returns a matrix.

.. code:: python

    W = load_graph(graph_file)


Split Data
----------
In ``s3l.datasets.data_manipulate``, we provide some useful functions to split data. Here is the list:

::

    'inductive_split',
    'ratio_split',
    'cv_split'

Among them, ``inductive_split`` can split the dataset into three parts: labeled set, unlabeled set and testing set, which is helpful for semi-supervised learning tasks.

.. code:: python

    from sklearn.datasets import make_classification
    from s3l.datasets import data_manipulate

    X, y = make_classification()
    train_idx, test_idx, label_idx, unlabel_idx = \
                data_manipulate.inductive_split(X, y,test_ratio=0.3,
                        initial_label_rate=0.05, split_count=10)

``ratio_split`` and ``cv_split`` help split the given data based on train/test ratio and k-Fold.

.. code:: python

    from sklearn.datasets import make_classification
    from s3l.datasets import data_manipulate

    X, y = make_classification()
    # ratio_split
    train_idx, test_idx = \
                data_manipulate.ratio_split(X, y, unlabel_ratio=0.3,
                    split_count=10)

    # cv_split
    train_idx, test_idx = \
                data_manipulate.cv_split(X, y, k=3, split_count=10)

The returned XXX_indexes are lists of indexes which can be directly used by built-in estimators.